Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove needless last_root for better reclaims #8148

Merged
merged 3 commits into from
Feb 12, 2020

Conversation

ryoqun
Copy link
Member

@ryoqun ryoqun commented Feb 6, 2020

Problem

last_root wasn't correctly restored from snapshot and its related range check is bad for purge_zero_lamport_account, which ultimately causes dangling storage entries some times when restoring from snapshot.

(a lot of original investigation report for #8130 is moved to new PR: #8176)

Summary of changes

After careful code scrutinization, I'm pretty sure the last_root thing can go away.

Just remove last_root

Background

I found this as part of investigation for #8130.

Part of #7167.

@codecov
Copy link

codecov bot commented Feb 6, 2020

Codecov Report

Merging #8148 into master will decrease coverage by <.1%.
The diff coverage is 100%.

@@           Coverage Diff            @@
##           master   #8148     +/-   ##
========================================
- Coverage      82%     82%   -0.1%     
========================================
  Files         249     249             
  Lines       53530   53503     -27     
========================================
- Hits        43913   43887     -26     
+ Misses       9617    9616      -1

@sakridge
Copy link
Member

sakridge commented Feb 6, 2020

Seems right to me.

Copy link
Member

@mvines mvines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome detective work as always @ryoqun! Any chance you could write a test for this?

@ryoqun
Copy link
Member Author

ryoqun commented Feb 8, 2020

Sad status report: It seems the description is wrong. I can't create a valid test demonstrating the described error condition... And I cannot find another alternative bad scenario at the moment....

This bug is hard. ;)

@ryoqun ryoqun force-pushed the last-root-restore branch from 4132eb6 to 5cdec64 Compare February 8, 2020 16:08
@ryoqun ryoqun force-pushed the last-root-restore branch from 5cdec64 to 3120504 Compare February 8, 2020 16:09
@@ -1156,9 +1153,7 @@ impl AccountsDB {
dead_slots
}

fn cleanup_dead_slots(&self, dead_slots: &mut HashSet<Slot>, last_root: u64) {
// a slot is not totally dead until it is older than the root
dead_slots.retain(|slot| *slot < last_root);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized all inputs of this function is ensured to be older or equal to last_root.

Notice the or equal to part, this assertion must be *slot <= last_root for purge_zero_lamport_accounts. But, we don't need this to begin with, as said above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, why are they guaranteed to be older than last_root?

Copy link
Member Author

@ryoqun ryoqun Feb 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can explain this by reading the code from here upwards the call graph.

Firstly, this cleanup_dead_slots(dead_slots, last_root) is only called from handle_reclaims(reclaims).
dead_slots are derived from reclaims by calling remove_dead_accounts() inside handle_reclaims(), which does just simple 1-to-1 storage -> slot mapping. Thus, we just need to explain in terms of reclaims further on.

handle_reclaims() are called only from purge_zero_lamport_accounts()(1) and store_with_hashes() (2). These are the only 2 code paths reaching to this function.

(1) For the purge_zero_lamport_accounts() cod epath, reasoning is simple:

The reclaims is a collection of results of accounts_index.purge(&pubkey), which in turn the sum of get_rooted_entries() for all such zero lamport pubkeys. So, reclaims (thus, dead_slots) are always rooted slots only. This is true for normal operation because the validator calls add_root() as it roots slots.

Also this is also true for snapshot restoration because generate_index() is called before purge_zero_lamport_accounts() and generate_index() does accounts_index.roots.insert(*slot_id).

(2) For the store_with_hashes(), reasoning is a bit longer:

The reclaims is ultimately only generated at accounts_index.update(). There is intermediately fns like accounts_db.update_index() and accounts_index.insert(), but the destination of this call graph is same.

Next, accounts_index.update(slot, ...) creates reclaims by concatenating two sets:
A: can_purge-able (= slot < max_root) entries
B: Old entries in slot by replacing with updated one

A is obviously fine because max_root is always member of accounts_index.roots. B is ensured not to be dead at remove_dead_accounts() because the case B is always inserting an updated alive entry into the slot.

So, as a normal operation, cleanup_dead_slots() never removes anything such that slot > last_root.

And leaving this line's retain causes halfway-deallocated state because cleanup_dead_slots() is only called once ever for each given slot. That's because later invocations of remove_dead_accounts() would remove any slots whose storage are already removed.

Thanks for reading a rather long comment. :)

So I think this retain does add little value and complicates code reading with seemingly overlapping and spread responsibility for ensuring purge-able slots at caller-site or at callee-site.

As a second thought, if you're a bit concerned for future changes that might break these assumptions, this PR can be changed to assert! that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Thanks for the description. I think it can be removed.

@ryoqun ryoqun changed the title [wip] Restore last_root to fix unintended storage delete [wip] Remove last_root Feb 8, 2020
@ryoqun ryoqun added the work in progress This isn't quite right yet label Feb 8, 2020
@@ -161,11 +154,6 @@ impl<T: Clone> AccountsIndex<T> {
}

pub fn add_root(&mut self, slot: Slot) {
assert!(
(self.last_root == 0 && slot == 0) || (slot >= self.last_root),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think removing this assertion is ok to remove last_root, first of all.

@ryoqun ryoqun added the security Pull requests that address a security vulnerability label Feb 10, 2020
@ryoqun ryoqun changed the title [wip] Remove last_root Remove needless last_root for better reclaims Feb 10, 2020
@ryoqun ryoqun removed the work in progress This isn't quite right yet label Feb 10, 2020
@mvines mvines added the v0.23 label Feb 10, 2020
Copy link
Member

@mvines mvines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sakridge is the best person to approve this PR, he's more familiar with the details in this part of the codebase than I

@ryoqun ryoqun removed the v0.23 label Feb 12, 2020
@ryoqun
Copy link
Member Author

ryoqun commented Feb 12, 2020

Removing the v0.23 label, as this is not strictly fix for #8130.

@ryoqun ryoqun merged commit 5872746 into solana-labs:master Feb 12, 2020
@ryoqun
Copy link
Member Author

ryoqun commented Feb 12, 2020

@sakridge Thanks for checking this! I've just merged!

@ryoqun ryoqun added the v0.23 label Mar 3, 2020
@ryoqun
Copy link
Member Author

ryoqun commented Mar 3, 2020

This got needed in the v0.23 branch for #8436.

mergify bot pushed a commit that referenced this pull request Mar 3, 2020
* Restore last_root to fix unintended storage delete

* Remove last_root thing altogether

* Remove unneeded test...

(cherry picked from commit 5872746)
solana-grimes pushed a commit that referenced this pull request Mar 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
security Pull requests that address a security vulnerability
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants